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A version  of  Griffiths’s  Diagnostic  Articulation  Test  (DAT)  using  three-word  items  i$  described.  The  test 
is  applicable  where  monosyllabic  words  or  sentence  lists  are  undesirable  or  inappropriate.  Each  word  of  an 
item  is  drawn  from  a separate  set  of  five  monosyllabic  real  words  differing  only  in  the  initial  or  final 
element.  For  each  item,  subjects  underline  one  word  in  each  of  the  three  sets  of  words  for  that  item  on 
t the  answer  sheet.  The  test  examines  reception  of  150  words  in  7 min  as  compared  with  50  words  in  5 min 

by  the  usual  single-word  format  and  preserves,  moreover,  the  effects  of  gross  temporal  distortions  and 
masking  that  occur  within  and  between  words  in  consecutive  discourse.  Discrimination  scores  for  the  tri- 
word  test  of  intelligibility  (TTI)  as  compared  with  the  Modified  Rhyme  Test  (MRT)  and  the  C.  I.  D. 
sentence  lists  all  taped  with  the  same  talker  were  lower  when  tested  in  quiet  and  %ven  relatively  lower  in 
noise.  The  multiple-choice  closed-set  response  permits  easy  administration,  scoring,  and  analysis  of 
confusions;  enunciation  of  three  semantically  unrelated  words  in  coarticulatory  succession  preserves 
interword  transitions  while  limiting  the  effects  of  memory  and  linguistic  redundance.  The  use  of  the  DAT 
lists  permits  a somewhat  more  detailed  analysis  of  errors  than  use  of  the  three-word  formant  with  the 
MRT  as  proposed  by  Williams  et  al,  (Aviat.  Space  and  Environm.  Med.  47,  154-158  (1976).  The 
NSMRL  TTI  is  proposed  as  a relatively  ease  of  administration  and  scoring  are  desirable.  ■ 

PACS  numbers:  43.70.Ep 


INTRODUCTION 

The  ideal  speech  reception  and  discrimination  test 
for  evaluating  components  (talker,  channel,  listener)  of 
a communication  chain  must  meet  two  basic  require- 
ments: it  must  be  a valid  sample  of  the  speech  the 
chain  carries  or  is  about  to  carry,  and  it  must  make 
possible  an  analysis  of  errors.  Not  many  speech  tests 
approach  the  ideal  in  both  these  regards.  Dozens  of 
sentence  intelligibility  tests  have  been  constructed,  but 
these  are  always  cumbersome  to  administer  and  score, 
and  furthermore,  even  with  key  word  emphasis  within 
the  sentences  they  do  not  lend  themselves  readily  to 
error  analysis.  Dozens  of  single-word  intelligibility 
lists  have  been  constructed  which  are  quick  and  easy  to 
administer  and  score,  and  make  for  very  precise  error 
analysis,  but  represent  only  poorly  the  speech  material 
of  direct  interest. 

^The  opinions  in  this  paper  are  solely  the  author’s  and  do  not 
necessarily  represent  the  views  of  the  U.S.  Navy. 
b)This  paper  is  taken  from  portions  of  a previous  report  by  J. 
E,  Atkinson,  “Measuring  Speech  Intelligibility  in  a Multipath 
Channel,  ” Tech.  Rep.  No.  5661,  U.  S.  Naval  Underwater 
Systems  Center,  New  London,  CT  06320  (1  September  1977). 
A preliminary  draft  of  the  construction  of  this  test  was  pre- 
sented to  NASA  (R.  L.  Sergeant,  “A  Tri-Word  Test  of  Intel- 
ligibility of  Speech,  ” Proc.  Syrup,  in  Speech  Interference, 
edited  by  W.  Shepherd  NASA- TM-X-72696,  NASA  Langley 
Research  tenter,  Hampton,  VA  (1975),  and  the  Acoustical 
Society  of  America  (J.  E.  Atkinson,  R.  L.  Sergeant,  and  P. 
G.  Lacroix,  “Speech  intelligibility  in  a stationary  multipath 
channel,  ”J.  Acoust.  Soc.  Am.  58,  S129  (A)  (1975). 
c>Present  address:  Department  of  Speech  and  Theater,  Hunter 
College,  New  York,  NY  10021. 


A major  difficulty  with  the  use  of  sentence  intelligibil- 
ity tests  is  that  the  variance  among  listeners’  responses 
depends  very,  heavily  on  the  match  between  each  listen- 
er’s condition  (experience,  intelligence,  etc.),  and  the 
vocabulary  and  content  of  the  message.  Single-word 
closed-set  response  tests,  in  which  all  allowable 
choices  are  given  on  the  answer  sheet,  reduce  this 
variance  so  far  as  possible.  On  the  other  hand,  tests 
with  single  word  lists  do  not  at  all  sample  the  acoustic 
and  prosodic  transitions  between  and  within  words  which 
are  so  much  a part  of  colloquial  speech. 


Beasley  and  Shriner  (1973)  followed  Speaks  and  Jerger 
in  constructing  word  strings  grouped  into  first-, 
second-,  and  third-order  sentences,  in  which  each  or- 
der more  closely  approximated  the  linguistic  constraints 
of  complete  sentences,  in  their  tests  the  words  were  - 
drawn  from  the  250  words  of  Griffiths’s  (1967)  Diagnos- 


Carhart  and  Porter  (1969)  first  proposed  a compro- 
mise combining  the  virtues  of  a single-word  test  with 
the  flow  of  sentential  approximations.  They  tape-re- 
corded items  of  three  words  each,  taken  from  the  CNC  . 
lists  of  Peterson  and  Lehiste  (1962);  this  tape  was  used . 
successfully  by  Vargo  (1977)  in  studying  amplitude  com- 
pression in  hearing  aids.  Speaks  and  Jerger  (1965)  con- 
structed ten-word  sentential  approximations  which  pro- 
gressively approached  colloquial  linguistic  complexity. 
Subjects  responded  by  identifying  in  a multiple -choice 
format  the  sentences  which  they  heard.  Although  this 
approach  controls  linguistic  variables  and  eliminated 
word-for-word  writedown  responses,  it  is,  as  the  au- 
thors point  out,  a test  of  sentence  identification  rather 
than,  of  intelligibility. 
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tic  Articulation  Test  (DAT).  Subjects  wrote  down  all 
words  heard  and  remembered.  This  format  is  amenable 
to  rather  complete  error  analysis,  but  there  may  be 
appreciable  serial  order  effects.  At  least  for  clinical 
purposes,  a shorter  word-string  would  be  preferable. 

A real  advance  was  made  by  Williams  et  al.  (1974), 
who  adapted  the  multiple-word  format  to  the  closed-set 
response  feature  of  the  Multiple  Rhyme  Test  (MRT)  of 
House  et  al.  (1963).  Strings  of  two  and  three  words 
were  compared  with  the  usual  single-word  format.  For 
each  of  the  words  in  an  item,  subjects  had  an  answer 
sheet  on  which  they  could  indicate  which  of  six  possibili- 
ties they  heard.  The  lists  were  recorded  in  the  context 
of  an  item  number  and  carrier  phase:  “One,  do  you 
read  sake?  Over.”  “Ten,  do  you  read  fit,  cutl  Over.” 
“Fourteen,  do  you  read  saw,  safe,  hold ? Over.”  In 
enunciating  multiple-word  lists  the  talker  adopted  a 
manner  and  rhythm  appropriate  for  a message,  but  he 
tried  to  give  “discrete  productions”  for  each  word.  The 
differences  in  scores  for  the  one-,  and  three  1 word  for- 
mats were  shown  to  be  negligible  across  speech-to- 
noise  ratios;  savings  were  demonstrated  in  that,  in  the 
three-word  format,  150  words  could  be  presented  in  7 
min,  as  against  50  words  in  the  usual  test  in  5 min. 

For  many  purposes,  an  advantage  exists  of  the  Grif- 
fiths DAT  over  the  MRT,  in  that  the  DAT  incorporates 
more  difficult  discriminations  involving  numbers  of  in- 
stances where  the  foil  words  differ  from  the  target  word 
in  only  one  dimension;  foils  in  the  MRT  could  differ  in 
perhaps  all  three-  of  the  dimensions  of  manner  of  artic- 
ulation, and  voicing.  Care  was  taken  in  the  construc- 
tion of  the  DAT  to  include  all  possible  contrasts,  and 
as  a result  the  DAT  provides  a more  precise  notion,  of 
the  underlying  nature  of  a listener’s  errors  or  of  the 
deficiencies  of  a channel  in  a communication  chain. 

In  our  work  a need  appeared  to  assess  a communica- 
tions chain  involving  time  smears  and  other  temporal 
distortions,  and  internal  masking  between  the  phonemes 
of  a sentence  both  within  and  across  words.  Complete 
and  accurate  analysis  of  errors  was  necessary,  and  yet 
the  material  had  to  be  approximately  of  sentence  length. 
We  combined  the  efficiency  of  the  three-word  format 
with  the  precision  of  error  analysis  which  the  DAT 
makes  possible,  and  created  the  NSMRL  tri-word  test 
of  intelligibility  (TTl). 

Because  experience  is  limited  with  the  three-word 
format,  it  was  desirable  to  validate  the  TTI  by  com- 
paring it  on  the  same  subjects  and  under  the  same  gen- 
eral conditions  with  more  widely  used  intelligibility 
tests  before  offering  it  as  an  acceptable  test  of  speech 
intelligibility.  Since  the  TTI  partakes  of  the  advantages 
both  of  a single -word  list  and  of  a sentence  list,  it 
should  be  validated  against  a sample  of  each  type.  We 
chose  the  MRT  of  House  et  al.  primarily  because  it 
had  been  recommended  by  a National  Research  Council 
committee  (Kreul  et  al. , 1968)  as  an  archetype  of  tests 
using  monosyllables  in  a closed-set  response  format. 

We  chose  the  C.I.D.  sentence  lists  (Silverman  and 
Davis,  1970)  because  they  had  been  written  following 
the  rules  laid  down  by  a National  Research  Council 
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committee  (Silverman  and  Hirsh,  1955)  for  constructing 
samples  of  colloquial  speech. 

This  paper  describes  in  detail  the  construction  erf  the 
NSMRL  TTI  and  compares  listener  performance  using 
a selection  of  speech-to-noise  (S/N)  ratios  with  per- 
formance both  with  the  MRT  and  with  the  C.I.D.  sen- 
tence lists. 

I.  GENERAL  METHOD 
A.  Tri-word  test  of  intelligibility 

Each  TTI  list  consisted  Of  50  three -word  items,  for 
a total  of  150  words.  The  three  individual  words  in  any 
list  were  drawn  at  random  from  one  of  Griffiths’s  five 
DAT  lists,  with  the  restriction  that  each  of  the  three 
words  come  from  a different  DAT  list.  Griffiths’s  lists 
each  contain  50  response  sets  as  follows: 

List 


A 

B 

C 

D 

E 

1. 

bat 

batch 

bask 

bass 

badge 

2. 

laws 

long 

log 

lodge 

lob 

13. 

beige 

base 

bayed 

bathe 

bays 

49. 

mat 

vat 

that 

fat 

rat 

50. 

way 

may 

gay 

they 

nay 

where  the  words  within  each  response  set  (Rows  1-50) 
differ  only  in  the  initial  or  in  the  final  consonant. 

To  construct  a TTI  item,  three -words. are  randomly 
selected  (e.  g. , “Badge-bayed-mat”)  from  lists  E,  C, 
and  A,  respectively,  (TTI  Lists  1,  2,  and  3 are  avail- 
able).1 

For  every  tri-word  item,  five-word  response  sets 
were  printed  in  three  columns,  one  for  every  word 
■ position  in  the  tri-word  item.  The  listener’s  task  was 
to  cross  out  the  word  heard  from  each  of  the  three  sets 
For  example,  if  the  stimulus  was  "badge-bayed-mat,” 
the  three  response  sets  for  that  item  were  as  follows: 


badge 

bathe 

mat 

batch 

base 

fat 

base 

bayed 

that 

bat 

bays 

rat 

bash 

beige 

vat 

Correct  answers  are  indicated  for  the  convenience  of 
the  reader.  An  8-s  subject  response  interval  was  al- 
lowed between  items. 

B.  Subjects 

Listeners  were  136  young  enlisted  men  drawn  at 
random  from  candidates  for  the  USN  Submarine  School. 
All  had  hearing  threshold  levels  .^  15  dB  from  0. 5-8 
kHz.  . 
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TABLE  I.  Mean  discrimination  score  by  listening  panel  and 
list. 


Listening  panel  No . 


List 

20 

1 

20 

2 

20 

3 

Difference  in 

mean  score 

Average 

CID  No.  1 

X 

99.3 

99.2 

0.1 

99.25  . 

CID  No.  2 

99.9 

99.8 

X 

0.1 

99.85 

CID  No.  3 

99.6 

X 

99.0 

0.6 

99.3 

MRT  A 

98.6 

X 

98.9 

0.3 

98.75 

MRT  B 

95.0 

96.2 

X 

1.2 

95,6 

MRT  C 

X 

96.9 

98.4 

1.5 

97.65 

TTI  No.  1 

90.6 

X 

93.5 

2.9. 

92.05 

TTI  No.  2 

90.6 

90.7 

X 

0.1 

90.65 

TTI  No.  3 

X 

90.8 

92.3 

1.5 

91.55 

show  that,  in  general,  the  TTI  is  a more  difficult  test 
(mean  DS=91.4)  than  either  the  sentences  (Mean  DS 
- 99. 5)  or  the  MRT  (mean  DS  = 97. 3). 

The  mean  DS  difference  between  panels  (column  4) 
indicates  that  interpanel  variability  was  relatively 
small;  only  1 of  9 comparisons  between  panels  ex- 
ceeded 1.  5 percentage  points  difference. 

III.  EXPERIMENT  II 

Order  of  presentation  of  lists,  S/N  ratios,  and  listen- 
ing panels  were  randomized  with  four  listening  panels. 
Based  on  preliminary  testing,  different  levels  of  noise 
were  mixed  with  the  various  speech  materials  to  equate 
degree  of  difficulty  (see  Fig.  1).  Each  panel  heard  six 
test  S/N  conditions. 


C.  Stimuli 

Master  tapes  were  made  using  a high  quality  micro- 
phone in  an  anechoic  chamber  and  an  Ampex  PR-10 
tape  recorder.  The  talker  (RLS)  was  a man  experi- 
enced in  intelligibility  testing  who  spoke  with  a general 
American  dialect.  Three  TTI  tests  were  recorded, 
three  of  the  MRT,  and  three  of  the  C.I.D.  sentences. 
Tri-word  items  were  spoken  as  monotonic  three -word 
phases,  not  as  “discrete  productions.  ” A calibration 
tone  at  1 kHz  was  recorded  on  each  tape. 

From  one  experiment  in  which  speech  material  was 
mixed  with  noise  at  a selection  of  speech-to- noise 
ratios,  the  ratios  were  determined  as  follows:  prior 
to  presentation,  measurements  using  a General  Radio 
graphic  level  recorder  were  made  of  each  item  at  the 
input  to  the  earphones.  The  mean  item  level  for  each 
list  was  computed  and  used  in  determining  S/N.  Speech- 
spectrum  noise  from  a General  Radio  901B  noise  gen- 
erator was  mixed  with  the  speech  in  an  appropriate  cir- 
cuit and  adjusted  to  give  the  desired  S/N  for  any  condi- 
tion (for  S/N’s  used,  see  Table  n). 

Seven  listening  panels  of  approximately  20  men  each 
• were  composed.  All  were  presented  with  materials 
monaurally  in  a room  fitted  with  20  matched  Permoflux 
PDR-8  earphones  in  MX-41/AR  cushions.  Presentation 
' level  for  all  tests  was  set  by  adjusting  calibration  tones 
to  70  dB  SPL,  as  determined  for  the  median  phone  of 
the  room  in  an  NBS  9-A  coupler.  Listeners  marked 
multiple-choice  answer  sheets  for  the  MRT  and  TTI, 
and  wrote  down  responses  to  the  sentences  on  a blank 
sheet  of  paper. 

II.  EXPERIMENT  I 

Three  listening  panels  were  each  presented  with  two 
lists  of  each  test  in  quiet  (see  experimental  design  in 
Table  I).  This  provided  both  baseline  data  and  an  indi-' 
cation  of  intergroup  reliability  for  every  list. 

A.  Results  and  discussion 

Table  I shows  mean  discrimination  score  (DS)  and 
mean  differences  between  panels  for  all  conditions  (see 
also  Fig.  1).  Overall  mean  OS’s  in  quiet  (last  column) 
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A.  Results  and  discussion 

Table  H lists  the  mean  DS  and  standard  deviations 
for  all  conditions.  Figure  1 presents  mean  DS  cor- 
rected for  chance. 


-20  -15  -10  -5  O 5 10  15  QUIET 


SPEECH-T0-N0ISE  RATIO  IN  DECIBELS 

FIG.  1.  Mean  per  cent  correct  intelligibility  scores  (correct- 
ed for  chance)  as  a function  of  Speech-to-Noise  ratio  for  three 
speech  materials.  Crosses:  C.  I.  D.  sentences.  Filled  cir- 
cles: MRT.  Open  circles:  TTI.  Scores  in  per  cent  correct 
are  not  directly  comparable  for  the  three  types  of  test.  On  the 
TTI,  a random  (guessing)  score  would  be  20%  correct,  while 
on  the  MRT  it  would  be  16.7%  correct  and  0%  for  the  sentences. 
In  order  to  render  the  data  comparable  across  tests,  raw 
scores  were  adjusted  by  the  following  formulas: 

Sentences:  Score  corrected  for  chance  =0.1  (X  — 0) 

MRT:  Score  corrected  for  chance  =0.12  (X- 16.7),  and 
TTI:  Score  corrected  for  chance  =0.125  (X-20) 

V 

where  X is  percent  correct  response.  These  formulas  trans- 
form the  raw  scores  to  a common  scale  ranging  from  0 for 
completely  random  responses  to  10  for  a perfect  score.  Scores 
corrected  for  chance  permit  a more  meaningful  comparison 
of  the  types  of  test. 
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TABLE  II.  Mean  discrimination  scores  and  standard  devia- 
tions  as  a function  of  speech-to-noise  ratio  for  experiment  II 


Test 

S/N  in  dB 

Mean  DS 

S.  D. 

CID  sentences 

-20 

'5.3 

5.5 

-15 

52.8 

14.7 

- 10 

64.8 

12.7 

-5 

90.2 

10.4 

0 

97.3 

3.1 

+ 5 

98.0 

' 4.2 

MRT 

-15 

18.6 

11.4 

-10 

47.4 

9.5 

-5 

54.7 

7.9 

0 

70.2 

7.5 

+ 5 

84.1 

5.2 

+ 10 

86.8 

4.8 

Word 

Word 

1st 

2nd 

3rd 

1st 

2nd 

3rd 

TTI  - 10 

34,6 

36,7 

36.9 

6.8 

6.7 

7.8 

-5 

44.4 

39.6 

42.7 

7.7 

6.5 

6.0 

0 

55.4 

49.6 

59.6 

7.4 

8.3 

10.2 

+ 5 

74.6 

68.6 

75.6 

11.2 

8.8 

7.9 

+ 10 

81.1 

78.8 

85.6 

8.1 

7.5 

4.9 

The  sentences  are  of  course  more  intelligible  in 
noise,  due  to  their  greater  inherent  redundancy,  than 
either  the  MRT  or  TTI.  The  TTI  is  the  least  intelli- 
gible as  a function  of  S/N,  presumably  as  a result  of 
greater  complexity  both  of  task  and  perceptual  process- 
ing. 

Since  subject-by-subject  variability  is  an  index  of 
test  reliability,  it  is  useful  to  compare  variability  about 
the  mean  at  a point  of  equal  difficulty  for  each  test. 

S/N  in  dB  for  the  mean  50%-correct  intelligibility  points 
were  therefore  determined:  these  were  - 14.  5 for  the 
sentences,  - 5. 2 for  the  MRT,  and  + 1. 0 for  the  TTI 
{see  Fig.  1).  The  experimental  conditions  correspond- 
ing most  closely  to  these  points  were  —15  for  sentences, 
- 5 for  the  MRT,  and  0 for  the  TTI.  The  associated 
standard  deviations  (see  Table  n)  were  14. 7 for  sen- 
tences, 7.  9 for  the  MRT,  and  8. 6 for  the  TTI. 

n For  all  practical  purposes,  the  variability  about  the 
mean  is  no  greater  for  the  TTI  than  for  the  MRT,  but  it 
is  considerably  less  than  for  the  sentences.  The  fact 
that  sentences  provide  a less  stable  mean  DS  is  simply 
a reflection  of  the  greater  variance  associated  with  the 
linguistic  complexity  of  the  material,  a feature  reduced 
intentionally  in  the  other  tests. 

The  variability  inherent  within  a speech  test  derives 
from  two  sources,  (1)  response  complexity  and  (2)  task 
complexity;  the  former  refers  to  the  influence  of  open 
versus  closed -set  response  biasing,  and  the  latter  to 
the  perceptual  processing  demand  which  a task  places 
on  the  listener.  The  CID  sentences  in  this  scheme  are 
higher  in  response  complexity  by  virtue  of  the  open-set 
response  format,  but  lower  in  task  complexity  due  to 
inherent  linguistic  redundancy.  The  converse  is  true 
for  the  MRT  and  TTI;  however,  for  the  TTI  there  is  an 
additional  variable 'which  contributes  to  task  complexity, 


that  of  serial  order  effects  Since  the  listener  must 
perceive  and  store  three  words  in  succession  and  then 
recall  them  in  order,  the  task  variable  is  more  com- 
plicated than  in  the  single-word  MRT  Test.  Overall, 
the  TTI  is  a more  suitable  test  of  intelligibility  because 
it  presents  the  listener  with  a longer  and  more  difficult 
task  while  limiting  response  complexity.  This  provides 
less  variability  and  lower  scores  resulting  in  a stable 
test  without  “ceiling  effects”. 
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